library(foreign)
library(car)
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4
library(nnet)
library(ggplot2)
library(reshape2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
## 
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity
library(ggplot2)

1. Introduction

In recent years, California has seen record-breaking numbers of wildfires. The wildfires impact wildlife and residents and also have serious environmental implications. These outcomes result from heightened CO2 levels, temperatures, and precipitation. Previous exploratory analysis has shown that lower soil moisture and rainfall led to larger fires between 1992 to 2015, while lower soil moisture and rain led to more wildfires caused by equipment use. It was also found that higher rainfall resulted in a higher chance that a wildfire was caused by lightning. Government spending seemed to have a marginal impact on suppressing wildfires in California.

With climate change being an increasingly relevant issue these days, this research studies how wildfires in California have changed over time. Specifically, we focused on wildfires from 1992 to 2013 and sought to investigate factors that may have contributed to a significant impact on the intensity or volume of wildfires in California over the years.

The first dataset we found detailed 24 years’ worth of nationwide wildfires from 1992 to 2015 along with their associated sizes, causes, time of occurrence, and other metadata. We felt that analyzing every state’s wildfire patterns would be an unfocused and unfruitful effort, so we decided to focus on California, as it deals with the most wildfires compared to other states in the United States.

However, we needed more data to conduct a more in-depth analysis, so we retrieved data from Cal-Adapt, which is an effort developed by the Geospatial Innovation Facility at the University of California, Berkeley focused on providing data that portrays climate change in California. From here, we were able to collect data about California’s daily air temperature, soil moisture, and rainfall. From there, we were able to combine it with our wildfire data by aggregating monthly averages. We also added California’s yearly budget to our combined dataset because we were curious as to how it played a factor in the volume and intensity of wildfires over the years. Combining and cleaning the data took some work, but it resulted in all the data we wanted for our analysis.

library(dplyr)
data_wildfire=read.csv('final_wildfire.csv')
data_large=subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
temp3=data_large %>% count(Year, DISCOVERY_DOY)
rainfall <- read.csv("rainfall_daily.csv")
soilmoisture <- read.csv("soilmoisture_daily.csv")
airtemp <- read.csv("airtemp_daily.csv")
airtemp$month <- strftime(airtemp$time, "%m")
airtemp$year <- strftime(airtemp$time, "%Y")
airtemp$DOY <- strftime(airtemp$time, "%j")
rainfall$year <- strftime(rainfall$time, "%Y")
rainfall$DOY <- strftime(airtemp$time, "%j")

cleaned_rainfall <- subset(rainfall, select = -c(time))
str(airtemp)
## 'data.frame':    23376 obs. of  5 variables:
##  $ time               : chr  "1950-01-01 00:00:00+00:00" "1950-01-02 00:00:00+00:00" "1950-01-03 00:00:00+00:00" "1950-01-04 00:00:00+00:00" ...
##  $ tair_day_livneh_vic: num  4.14 1.77 -3.09 -3.59 -2.75 ...
##  $ month              : chr  "01" "01" "01" "01" ...
##  $ year               : chr  "1950" "1950" "1950" "1950" ...
##  $ DOY                : chr  "001" "002" "003" "004" ...
cleaned_airtemp <- subset(airtemp, select = -c(time))
soilmoisture$year <- strftime(rainfall$time, "%Y")
soilmoisture$DOY <- strftime(airtemp$time, "%j")

cleaned_soilmoisture <- subset(soilmoisture, select = -c(time))
joined1 <- merge(cleaned_airtemp, cleaned_soilmoisture, by.x=c("year",'DOY'), by.y=c("year",'DOY'))
joined2 <- merge(joined1, cleaned_rainfall, by.x=c("year",'DOY'), by.y=c("year",'DOY'))
joined2$year=as.integer(joined2$year)
joined2$DOY=as.integer(joined2$DOY)
temp=aggregate(tair_day_livneh_vic~Year+month,data=data_wildfire,mean)
temp1=aggregate(soilmoist1_day_livneh_vic~Year+month,data=data_wildfire,mean)

temp2=aggregate(rainfall_day_livneh_vic~Year+month,data=data_wildfire,mean)

joined3=merge(temp,temp1,by.x=c('Year','month'),by.y=c('Year','month'))
joined3=merge(joined3,temp2,by.x=c('Year','month'),by.y=c('Year','month'))
temp1=data_wildfire %>% count(Year, month)

temp2=aggregate(FIRE_SIZE~Year+month,data=data_wildfire,mean)
joined4 = merge(temp2, temp1, by.x=c('Year', 'month'), by.y=c('Year', 'month'))
joined4 = merge(joined3, joined4, by.x=c('Year','month'), by.y=c('Year', 'month'))

2. Basic EDA of wildfires in California

The initial exploratory data analysis made it clear that most fires in California were correlated with high average air temperatures, low soil moisture, and periods of low rainfall. To predict factors that may cause future wildfires, the following questions need to be addressed:

  • How strong is the relationship between California fires and the average rainfall, air temperature, and soil moisture?
  • Does California’s yearly fire suppression budget play any role in mitigating wildfires?
  • How strongly can we predict large(class) wildfires using rainfall, air temperature, and soil moisture data?
  • How can the long-term wildfires be predicted based on the month’s temperature, soil moisture, and rainfall with reasonable accuracy?

To answer the first question, we used a correlation matrix to study the impact of temperature, soil moisture, and rainfall on the fire size. The correlation matrix showed a weak positive correlation between average air temperature and the fire size, whereas there was a strong negative correlation between fire size and soil moisture and fire size and rainfall.

library(corrplot)
## corrplot 0.84 loaded
joined4$totalarea=joined4$n*joined4$FIRE_SIZE
cordata=joined4
colnames(cordata)[3]='AvgTemp'
colnames(cordata)[4]='AvgSoilMoisture'
colnames(cordata)[5]='AvgRainfall'
cor_fire=cor(cordata[c(3,4,5,6,7,8)])
corrplot(cor_fire,method='number',type = 'lower', diag = TRUE)

We studied the frequency of different fire class sizes to answer the second question. It was observed that the fire size class in California has a high frequency of class A and B fire sizes. These fire class sizes of wildfires are not considered a threat to the environment and nearby properties since they typically disappear soon. The fire class size C and higher could be dangerous as the burn size grows fast in such fire cases.

To answer the second question, we looked at plots of California’s fire suppression budget and wildfire count over the years to see if there was anything noteworthy. Below, we can see that there is no strong relationship between the money spent controlling wildfires and the number of wildfires that occur each year.

budge <- read.csv("fire_suppression.csv")
budge$Budget <- as.numeric(gsub("[$,]", "", budge$Budget))
budge$Budget <- budge$Budget / 1000000

firesByYear <- data_wildfire %>% count(Year)
firesBudget <- merge(firesByYear, budge, by="Year")

ggplot(firesBudget, aes(x=Year, y=Budget)) + geom_bar(stat='identity') + ylim(0, max(budge$Budget)) + xlab("Year") +
  ylab("Budget (in Billions $)") + ggtitle("California Wildfire Suppression Budget")

ggplot(firesBudget) + geom_line(aes(x=Year, y=n), stat='identity', group=1) + xlab("Year") + ylab("Frequency") +
  ggtitle("California Fire Frequency")

3. Linear Regression of wildfires with different cause(natrual or people-caused)

First, linear modeling was used to predict the total number of fires in California, the fire size, and the burn area with respect to the environmental factors as the independent variables. These models were also looked at individually for naturally caused wildfires and those caused by people to see if there is a difference in the wildfire’s impact.

summary_nature=read.csv('summary_nature.csv')
summary_peoplecaused=read.csv('summary_peoplecaused.csv')
#Renaming variables in summary_nature
Avg_Temp <- summary_nature$tair_day_livneh_vic
Avg_SoilMoisture <- summary_nature$soilmoist1_day_livneh_vic
Avg_Rainfall <- summary_nature$rainfall_day_livneh_vic

#Renaming variables in summary_peoplecaused
Avg_Temp <- summary_peoplecaused$tair_day_livneh_vic
Avg_SoilMoisture <- summary_peoplecaused$soilmoist1_day_livneh_vic
Avg_Rainfall <- summary_peoplecaused$rainfall_day_livneh_vic

3.1 Model for all fire incidents

model_nnature <- lm(log(n)~ Avg_Temp + Avg_SoilMoisture, data = summary_nature)
#summary(model_nnature)
xkabledply(model_nnature, title = paste("Nature Caused:",format(formula(model_nnature)) ))
Nature Caused: log(n) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.1533 0.3724 21.8919 0
Avg_Temp 0.0595 0.0090 6.5931 0
Avg_SoilMoisture -0.2529 0.0167 -15.1381 0
plot(model_nnature)

## Logn ~ nature variables        - PEOPLE CAUSED (P)
model_npeople <- lm(log(n)~ Avg_Temp + Avg_SoilMoisture, data = summary_peoplecaused)
#summary(model_npeople)
xkabledply(model_npeople, title = paste("People Caused:",format(formula(model_npeople)) ))
People Caused: log(n) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2814 0.3131 23.2584 0
Avg_Temp 0.0526 0.0076 6.9257 0
Avg_SoilMoisture -0.2251 0.0140 -16.0264 0
plot(model_npeople)

The first model with the total number of fires as the dependent variable had an adjusted R-squared value of %. This means that the environmental factors can explain 87% of the total variation in the number of fires when the fires were naturally caused. The R-squared value was % for the model where the fires were man-made. The coefficients of the independent variables showed that, for both the models, there was a direct relationship between the air temperature and the number of fires and the soil moisture and the number of fires. There was an indirect relationship between rainfall and the number of fires.

3.2 Linear Model for Predicting Fire Size

## FIRE_SIZE ~ nature variables       - NATURE CAUSED (N)
model_sizenature <- lm(log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture , data = summary_nature)
#summary(model_sizenature)
xkabledply(model_sizenature, title = paste("Nature Caused:",format(formula(model_sizenature)) ))
Nature Caused: log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5705 1.0788 7.0177 0.0000
Avg_Temp 0.0291 0.0261 1.1144 0.2661
Avg_SoilMoisture -0.3783 0.0484 -7.8172 0.0000
plot(model_sizenature)

model_sizepeople <- lm(log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture , data = summary_peoplecaused)
#summary(model_sizepeople)
xkabledply(model_sizepeople, title = paste("People Caused:",format(formula(model_sizepeople) )))
People Caused: log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8934 1.1114 4.4029 0.0000
Avg_Temp 0.0756 0.0269 2.8048 0.0054
Avg_SoilMoisture -0.2687 0.0499 -5.3907 0.0000
plot(model_sizepeople)

Next, we modeled the environmental factors and their impact on the fire size. The environmental factors can explain % of the variation in the fire size in the naturally caused fires model, whereas only % of the variation could be explained by those factors in the people caused fires model. The relationship of the independent variables was the same with the fire size as it was with the number of fires.

3.3 Model for Predicting Burning Area

model_areanature <- lm(log(FIRE_SIZE*n) ~ Avg_Temp + Avg_SoilMoisture , data = summary_nature)
#summary(model_areanature)
xkabledply(model_areanature, title = paste("Nature Caused:",format(formula(model_areanature)) ))
Nature Caused: log(FIRE_SIZE * n) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.7238 1.1627 13.5237 0.0000
Avg_Temp 0.0887 0.0282 3.1459 0.0018
Avg_SoilMoisture -0.6312 0.0522 -12.1022 0.0000
plot(model_areanature)

model_areapeople <- lm(log(FIRE_SIZE*n) ~ Avg_Temp + Avg_SoilMoisture , data = summary_peoplecaused)
#summary(model_areapeople)
xkabledply(model_areapeople, title = paste("People Caused:",format(formula(model_areapeople) )))
People Caused: log(FIRE_SIZE * n) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.1748 1.2004 10.1423 0
Avg_Temp 0.1281 0.0291 4.4031 0
Avg_SoilMoisture -0.4938 0.0538 -9.1707 0
plot(model_areapeople)

Finally, we looked at the burn area and how environmental factors impact it. We found that % of the variation in burn size was explained in the naturally caused fires model, and % was explained in the people caused fires model. Again, the relationship between the variables stayed consistent with previous models.

When looking at the difference in the impact of the fires based on their cause, the linear models showed that regardless of if the wildfire was started naturally or was manmade, since the relationship between the variables remained consistent, the impact of both kinds of wildfires was equally as devastating.

4.Predicting Large Fires (Class C and Above)

Through EDA, we found that the fire size class in California has a very high frequency in A and B fire class size. From the dataset, we can find that wildfires in class A and B sizes are observed every day, and those kinds of wildfires are self-limiting and are expected to disappear soon. As previously mentioned, other kinds of wildfires fall under fire class size C or higher. Those kinds of wildfires are not typically observed frequently and could burn over a larger area. Such wildfires are a threat to the environment, wildlife, and people living in the area. The size C fire class and above are thus, considered dangerous as the burned size grows fast. So, this research focuses on the analysis of larger fires. The class of fire with C or above appears in 4410 days in California from 1992 to 2015. We are trying to make a model that predicts the chances of getting a wildfire by given average temperature, soil moisture, and rainfall by building a logit regression model.

In the dataset, we labeled the days of no observation with 0 and days of at least 1 observation with 1. During training, we try several configurations on input features. It is found that the model with temperature, soil moisture, and rainfall cannot be improved with temperature and soil moisture included in the model.

library(dplyr)
data_wildfire=read.csv('final_wildfire.csv')
data_large=subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
temp3=data_large %>% count(Year, DISCOVERY_DOY)
rainfall <- read.csv("rainfall_daily.csv")
soilmoisture <- read.csv("soilmoisture_daily.csv")
airtemp <- read.csv("airtemp_daily.csv")
airtemp$month <- strftime(airtemp$time, "%m")
airtemp$year <- strftime(airtemp$time, "%Y")
airtemp$DOY <- strftime(airtemp$time, "%j")
rainfall$year <- strftime(rainfall$time, "%Y")
rainfall$DOY <- strftime(airtemp$time, "%j")

cleaned_rainfall <- subset(rainfall, select = -c(time))
str(airtemp)
## 'data.frame':    23376 obs. of  5 variables:
##  $ time               : chr  "1950-01-01 00:00:00+00:00" "1950-01-02 00:00:00+00:00" "1950-01-03 00:00:00+00:00" "1950-01-04 00:00:00+00:00" ...
##  $ tair_day_livneh_vic: num  4.14 1.77 -3.09 -3.59 -2.75 ...
##  $ month              : chr  "01" "01" "01" "01" ...
##  $ year               : chr  "1950" "1950" "1950" "1950" ...
##  $ DOY                : chr  "001" "002" "003" "004" ...
cleaned_airtemp <- subset(airtemp, select = -c(time))
soilmoisture$year <- strftime(rainfall$time, "%Y")
soilmoisture$DOY <- strftime(airtemp$time, "%j")

cleaned_soilmoisture <- subset(soilmoisture, select = -c(time))
joined1 <- merge(cleaned_airtemp, cleaned_soilmoisture, by.x=c("year",'DOY'), by.y=c("year",'DOY'))
joined2 <- merge(joined1, cleaned_rainfall, by.x=c("year",'DOY'), by.y=c("year",'DOY'))
joined2$year=as.integer(joined2$year)
joined2$DOY=as.integer(joined2$DOY)
str(joined2)
## 'data.frame':    23376 obs. of  6 variables:
##  $ year                     : int  1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
##  $ DOY                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ tair_day_livneh_vic      : num  4.14 1.77 -3.09 -3.59 -2.75 ...
##  $ month                    : chr  "01" "01" "01" "01" ...
##  $ soilmoist1_day_livneh_vic: num  18.3 18.5 18.3 18.3 18.1 ...
##  $ rainfall_day_livneh_vic  : num  0.67193 0.79325 0.07883 0.08991 0.00174 ...
joined2=subset(joined2, year>= 1992 & year<= 2015)
joined5=joined2
joined5$DOY=joined5$DOY + 1


data_large=merge(joined2,temp3, by.x=c('year','DOY'),by.y=c('Year','DISCOVERY_DOY'),all.x=T)
data_large$n[is.na(data_large$n)]=0
data_large$fire=1
data_large$fire[data_large$n==0]=0
str(data_large)
## 'data.frame':    8036 obs. of  8 variables:
##  $ year                     : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
##  $ DOY                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ tair_day_livneh_vic      : num  3.78 4.17 4.19 4.87 4.94 ...
##  $ month                    : chr  "01" "01" "01" "01" ...
##  $ soilmoist1_day_livneh_vic: num  20.2 20.5 21.4 23.6 25.1 ...
##  $ rainfall_day_livneh_vic  : num  0.0616 1.1337 3.0754 7.0797 10.8035 ...
##  $ n                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fire                     : num  0 0 0 0 0 0 0 0 0 0 ...
data_large2=merge(joined5,temp3, by.x=c('year','DOY'),by.y=c('Year','DISCOVERY_DOY'),all.x=T)
data_large2$n[is.na(data_large2$n)]=0
data_large2$fire=1
data_large2$fire[data_large2$n==0]=0

for (i in range(1992,2015)){
  tdate=max(subset(data_large2,year==i)$DOY)
  data_large2$year[data_large$year == i & data_large2$DOY==tdate]=i+1
  data_large2$DOY[data_large$year == i & data_large2$DOY==tdate]=1
}
str(data_large2)
## 'data.frame':    8036 obs. of  8 variables:
##  $ year                     : num  1992 1992 1992 1992 1992 ...
##  $ DOY                      : num  2 3 4 5 6 7 8 9 10 11 ...
##  $ tair_day_livneh_vic      : num  3.78 4.17 4.19 4.87 4.94 ...
##  $ month                    : chr  "01" "01" "01" "01" ...
##  $ soilmoist1_day_livneh_vic: num  20.2 20.5 21.4 23.6 25.1 ...
##  $ rainfall_day_livneh_vic  : num  0.0616 1.1337 3.0754 7.0797 10.8035 ...
##  $ n                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fire                     : num  0 0 0 0 0 0 0 0 0 0 ...

4.1 Large fire prediction model, by logit regression

model_large=glm(fire~tair_day_livneh_vic+soilmoist1_day_livneh_vic,data=data_large,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Logit Regression :",format(formula(model_large)) ))
Logit Regression : fire ~ tair_day_livneh_vic + soilmoist1_day_livneh_vic
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.0691 0.3117 9.8458 0
tair_day_livneh_vic 0.1461 0.0080 18.1739 0
soilmoist1_day_livneh_vic -0.3157 0.0145 -21.7253 0
library('pROC')
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
prob=predict(model_large, type = "response" )
data_large$prob=prob
h = roc(fire~prob, data=data_large)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h) 
## Area under the curve: 0.9017
plot(h)

library("regclass")
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following object is masked from 'package:caret':
## 
##     predictors
## The following object is masked from 'package:car':
## 
##     logit
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
## 
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
## 
##     qq
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
Confusion matrix for large fire predictor
Predicted 0 Predicted 1 Total
Actual 0 3265 659 3924
Actual 1 692 3420 4112
Total 3957 4079 8036

The model with temperature and soil moisture achieved the AUC score of 0.9017. Even though the results are high enough to predict the chances, we need to forecast the situation; this model needs the simultaneous data of input features to predict wildfire; it could be more convenient to use the model with the daily data to predict if there will be any fires the next day. In order to increase the useability of this model, we adjust the model to predict the wildfires in the next day by merging the observations of the previous day’s environmental data. We then trained the logit model with this adjusted data set.

model_large=glm(fire~tair_day_livneh_vic+soilmoist1_day_livneh_vic+soilmoist1_day_livneh_vic+rainfall_day_livneh_vic,data=data_large2,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Predicting Probability :", format(formula(model_large))))
Predicting Probability : fire ~ tair_day_livneh_vic + soilmoist1_day_livneh_vic + soilmoist1_day_livneh_vic +
Predicting Probability : rainfall_day_livneh_vic
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.4707 0.3245 7.6148 0
tair_day_livneh_vic 0.1508 0.0081 18.6813 0
soilmoist1_day_livneh_vic -0.2743 0.0158 -17.3927 0
rainfall_day_livneh_vic -0.1312 0.0256 -5.1318 0
library('pROC')
prob=predict(model_large, type = "response" )
data_large2$prob=prob
h = roc(fire~prob, data=data_large2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h) 
## Area under the curve: 0.9002
plot(h)

library("regclass")
library("ezids")
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
Confusion matrix for large fire predictor
Predicted 0 Predicted 1 Total
Actual 0 3244 680 3924
Actual 1 689 3423 4112
Total 3933 4103 8036
model_large=glm(fire~tair_day_livneh_vic,data=data_large2,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Logit Regression :", format(formula(model_large))))
Logit Regression : fire ~ tair_day_livneh_vic
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.7380 0.0821 -45.5091 0
tair_day_livneh_vic 0.2833 0.0058 48.6031 0
library('pROC')
prob=predict(model_large, type = "response" )
data_large2$prob=prob
h = roc(fire~prob, data=data_large2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h) 
## Area under the curve: 0.8799
plot(h)

library("regclass")
library("ezids")
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
Confusion matrix for large fire predictor
Predicted 0 Predicted 1 Total
Actual 0 3222 702 3924
Actual 1 780 3332 4112
Total 4002 4034 8036

and the model with parameters of soil moisture and temperature get an AUC of 0.9002. This AUC score is close to the model with same-day data. Next, we simplify the model by reducing the feature required in prediction. We find the model with temperature input has an AUC of 0.8799. This model is still useful for predicting wildfires the next day with fewer data points. To be specific, in the confusion matrix, the model with temperature and soil moisture input has 689 types 2 errors out of 8036 total days, and the model with only temperature input has 780 type 2 errors. It suppose that the model can predict next day situation with no significant AUC difference.

4.2 Multinomial Regression Models

Next, we used multinomial regression to predict characteristics of different wildfires based on the month’s environmental conditions. This model can be used to help fire response teams in California to allocate resources whenever conditions meet certain criteria and put a location at risk for a large wildfire, which would dramatically improve the fire response time and efficiency in resource allocation.

data_wildfire$STAT_CAUSE_CODE <- as.factor(data_wildfire$STAT_CAUSE_CODE)
data_wildfire$FIRE_SIZE_CLASS <- as.factor(data_wildfire$FIRE_SIZE_CLASS)
split <- createDataPartition(data_wildfire$FIRE_SIZE_CLASS, p = .70, list = FALSE)
train <- data_wildfire[split,]
test <- data_wildfire[-split,]

For these models, the independent variables were rainfall, air temperature, and soil moisture, and the dependent variable was fire size class. There are two main categorical variables of interest in this model: fire cause and fire class size. The first model produced a lower accuracy of 30%. Taking a deeper look at it, a lower accuracy score makes sense. Among 13 causes, most of the wildfires were a result of three main causes - lightning, debris burning, and “miscellaneous” with miscellaneous being the most popular cause. 27% of the wildfire records fell into this category, which is a large share compared to the other causes; lightning accounted for 14% of fires, while debris burning accounted for 7%. Other causes of wildfires were not nearly as frequent, which resulted in the model being biased towards predicting that the fires were caused by lighting, debris burning, or “miscellaneous” reasons.

model1 <- multinom(FIRE_SIZE_CLASS ~ month + tair_day_livneh_vic + 
                     soilmoist1_day_livneh_vic + 
                     rainfall_day_livneh_vic, data = data_wildfire)
## # weights:  42 (30 variable)
## initial  value 341878.899998 
## iter  10 value 188810.654841
## iter  20 value 186552.059750
## iter  30 value 172788.345057
## iter  40 value 172120.409267
## iter  50 value 172097.588032
## iter  50 value 172097.586927
## iter  50 value 172097.586927
## final  value 172097.586927 
## converged
summary(model1)
## Call:
## multinom(formula = FIRE_SIZE_CLASS ~ month + tair_day_livneh_vic + 
##     soilmoist1_day_livneh_vic + rainfall_day_livneh_vic, data = data_wildfire)
## 
## Coefficients:
##   (Intercept)        month tair_day_livneh_vic soilmoist1_day_livneh_vic
## B    0.759388 -0.024553195         -0.01066692               -0.04254142
## C   -1.743702 -0.011366345          0.01805757               -0.05708878
## D   -3.824299  0.002144915          0.03116456               -0.04269029
## E   -3.893592 -0.050285175          0.05062160               -0.08665472
## F   -6.040187 -0.038727088          0.09023196               -0.02059583
## G   -8.346085  0.142149277          0.11367589               -0.03308442
##   rainfall_day_livneh_vic
## B             -0.09605759
## C             -0.11932628
## D             -0.10537697
## E             -0.02901490
## F             -0.15492638
## G             -0.53080612
## 
## Std. Errors:
##   (Intercept)       month tair_day_livneh_vic soilmoist1_day_livneh_vic
## B  0.08356935 0.002605480         0.001487801               0.003966467
## C  0.19564010 0.006387345         0.003407768               0.009356847
## D  0.41304804 0.013663305         0.007203957               0.019684066
## E  0.58072831 0.020076602         0.009997365               0.028210234
## F  0.75102213 0.026845046         0.013165733               0.035979281
## G  1.14104812 0.041885212         0.019792929               0.054126569
##   rainfall_day_livneh_vic
## B             0.006644228
## C             0.018048519
## D             0.037504986
## E             0.049045297
## F             0.075269686
## G             0.157241242
## 
## Residual Deviance: 344195.2 
## AIC: 344255.2
# Predicting the values for train dataset
train$predictions <- as.factor(predict(model1, newdata = train, "class"))

# Building classification table
tab <- table(train$FIRE_SIZE_CLASS, train$predictions)

# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 50.84
cm2 <- confusionMatrix(train$predictions, train$FIRE_SIZE_CLASS)

cm2df <- as.data.frame(cm2$table)
cm2df$Prediction <- factor(cm2df$Prediction, levels=rev(levels(cm2df$Prediction)))

ggplot(cm2df, aes(Prediction,Reference, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="#009194") +
        labs(x = "Reference",y = "Prediction", title="Confusion Matrix for Predicted Fire Sizes")

# Predicting the values for test dataset
test$predictions <- as.factor(predict(model1, newdata = test, "class"))

# Building classification table
tab <- table(test$FIRE_SIZE_CLASS, test$predictions)

# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 50.85
# Creating  confusion matrix
cm2 <- confusionMatrix(test$predictions, test$FIRE_SIZE_CLASS)

cm2df <- as.data.frame(cm2$table)
cm2df$Prediction <- factor(cm2df$Prediction, levels=rev(levels(cm2df$Prediction)))

# Plotting confusion matrix
ggplot(cm2df, aes(Prediction,Reference, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="#009194") +
        labs(x = "Reference",y = "Prediction", title="Confusion Matrix for Predicted Fire Sizes")

With the next model, we predicted which class fire would occur as a result of the month’s environmental conditions, and how big a fire would be if one were to happen. This model came back with a 50.9% accuracy. This accuracy score was also lower than it was expected. However, after looking into it further, this model was biased for the same reason as the previous model - a large data imbalance. In particular, class A and B fires, the smallest classes of fires, accounted for 90.87% of fires in the dataset. In fact, due to this imbalance, the model mostly predicted Class A fires for every fire in our training dataset, resulting in predictions that were not very useful.

model2 <- multinom(STAT_CAUSE_CODE ~ tair_day_livneh_vic + 
                     soilmoist1_day_livneh_vic + 
                     rainfall_day_livneh_vic, data = train)
## # weights:  65 (48 variable)
## initial  value 315288.704919 
## iter  10 value 255132.024941
## iter  20 value 253739.869288
## iter  30 value 252422.805898
## iter  40 value 250092.813541
## iter  50 value 237750.434312
## iter  60 value 237180.495739
## iter  70 value 237172.061944
## iter  80 value 237168.735350
## iter  90 value 237167.524407
## final  value 237167.483934 
## converged
summary(model2)
## Call:
## multinom(formula = STAT_CAUSE_CODE ~ tair_day_livneh_vic + soilmoist1_day_livneh_vic + 
##     rainfall_day_livneh_vic, data = train)
## 
## Coefficients:
##    (Intercept) tair_day_livneh_vic soilmoist1_day_livneh_vic
## 2    5.7219418          -0.2431706              0.0015705764
## 3    4.3744351          -0.2513196             -0.0290211195
## 4    4.8835304          -0.2660508             -0.0156773862
## 5    4.0824678          -0.3165747              0.1387873718
## 6    2.8640188          -0.2447251             -0.0821090792
## 7    5.0954564          -0.2529102              0.0081553514
## 8    3.7769386          -0.2682738              0.0543729277
## 9    6.4474037          -0.2678762             -0.0003845994
## 10  -2.9095860          -0.1300278              0.1046886650
## 11   0.9586907          -0.2487532              0.0993754481
## 12  -0.5305952          -0.2614397              0.0517155962
## 13   2.0881075          -0.2348208              0.1622823738
##    rainfall_day_livneh_vic
## 2               -0.5116203
## 3               -0.5677801
## 4               -0.3918518
## 5               -0.6264776
## 6               -0.5990995
## 7               -0.5268377
## 8               -0.5830283
## 9               -0.5026156
## 10              -1.6470925
## 11              -0.4153789
## 12              -0.6556101
## 13              -0.6672742
## 
## Std. Errors:
##    (Intercept) tair_day_livneh_vic soilmoist1_day_livneh_vic
## 2    0.1718016         0.003816083               0.009415995
## 3    0.2669450         0.005761066               0.014480841
## 4    0.2251021         0.004922530               0.012159074
## 5    0.1979615         0.004425727               0.010560394
## 6    0.6594055         0.013838958               0.036393408
## 7    0.1919473         0.004234352               0.010435798
## 8    0.2426104         0.005318668               0.012902304
## 9    0.1671194         0.003728101               0.009164372
## 10   1.3957371         0.031108746               0.072403893
## 11   0.4861194         0.010666612               0.024839970
## 12   1.6132911         0.034674852               0.083828553
## 13   0.2214414         0.004919235               0.011630885
##    rainfall_day_livneh_vic
## 2               0.01418935
## 3               0.02895691
## 4               0.01796642
## 5               0.01655305
## 6               0.08753391
## 7               0.01694069
## 8               0.02335467
## 9               0.01322479
## 10              0.34932358
## 11              0.03666006
## 12              0.19604528
## 13              0.02094008
## 
## Residual Deviance: 474335 
## AIC: 474431
# train model on training set
train$predictions <- predict(model2, newdata = train, "class")

# Create confusion matrix based on training set's accuracy
cm <- confusionMatrix(train$predictions, train$STAT_CAUSE_CODE)
cmdf <- as.data.frame(cm$table)
cmdf$Prediction <- factor(cmdf$Prediction, levels=rev(levels(cmdf$Prediction)))
tab <- table(train$STAT_CAUSE_CODE, train$predictions)
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 29.89
# Plot  the confusion matrix
ggplot(cmdf, aes(Reference,Prediction, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="#009194") +
        labs(x = "Reference",y = "Prediction", title="Confusion Matrix for Predicted Fire Causes")

# Use the testing set to test our model
test$predictions <- predict(model2, newdata = test, "class")
tab <- table(test$STAT_CAUSE_CODE, test$predictions)
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 29.87
#  Creating confusion  matrix
cm <- confusionMatrix(test$predictions, test$STAT_CAUSE_CODE)

cmdf <- as.data.frame(cm$table)
cmdf$Prediction <- factor(cmdf$Prediction, levels=rev(levels(cmdf$Prediction)))
# Plotting confusioni matrix
ggplot(cmdf, aes(Reference,Prediction, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="#009194") +
        labs(x = "Reference",y = "Prediction", title="Confusion Matrix for Predicted Fire Causes")

5 Long term estimation of wildfire cases, fire size and burning area

5.1 predictions of Wildfire Cases, fire size and burning area

We also wanted to evaluate the long term effect on the wildfires by nature factors. We tried to predict the cases of wildfires, average fire size and total burned fire by nature factors.

From the corrplot in EDA part, we can find that the number of case have strong correlation with environment variable. And the fire size, total area burned have some correlation with environment variable. We try to use linear model for those predicted variable, the result suppose that the model is not fit well. Then, we take log to those predicted variable. Then the models are improved.

#Total Fire Cases
model_case=lm(log(n)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_case)
xkabledply(model_case, title = paste("Linear Regression :", format(formula(model_case))))
Linear Regression : log(n) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8261 0.2806 31.4571 0
tair_day_livneh_vic 0.0526 0.0068 7.7415 0
soilmoist1_day_livneh_vic -0.2322 0.0126 -18.4468 0
VIF(model_case)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.653691                  3.653691
plot(model_case)

#Average Fire Size
model_avgsize=lm(log(FIRE_SIZE)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_avgsize)
xkabledply(model_avgsize, title = paste("Linear Regression :", format(formula(model_avgsize))))
Linear Regression : log(FIRE_SIZE) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4512 0.8955 7.2041 0.0000
tair_day_livneh_vic 0.0409 0.0217 1.8858 0.0604
soilmoist1_day_livneh_vic -0.2994 0.0402 -7.4540 0.0000
VIF(model_avgsize)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.653691                  3.653691
plot(model_avgsize)

#Total Fire Area
model_totalarea=lm(log(totalarea)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_totalarea)
xkabledply(model_totalarea, title = paste("Linear Regression :", format(formula(model_totalarea))))
Linear Regression : log(totalarea) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.2773 0.9527 16.0351 0e+00
tair_day_livneh_vic 0.0936 0.0231 4.0523 1e-04
soilmoist1_day_livneh_vic -0.5316 0.0427 -12.4385 0e+00
VIF(model_totalarea)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.653691                  3.653691
plot(model_totalarea)

The r-squared value for model of cases is 0.9 The r-squared value for model of average fire size 0.5377 The r-squared value for model of average total burned size with 0.7825 By the plot check we found that the cases and burned area model fit the linear model assumption well.

temp=aggregate(tair_day_livneh_vic~Year+month,data=data_wildfire,mean)
temp1=aggregate(soilmoist1_day_livneh_vic~Year+month,data=data_wildfire,mean)

temp2=aggregate(rainfall_day_livneh_vic~Year+month,data=data_wildfire,mean)

joined3=merge(temp,temp1,by.x=c('Year','month'),by.y=c('Year','month'))
joined3=merge(joined3,temp2,by.x=c('Year','month'),by.y=c('Year','month'))
temp1=subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
temp2=aggregate(FIRE_SIZE~Year+month,data=temp1,mean)
temp1=temp1 %>% count(Year, month)


joined4 = merge(temp2, temp1, by.x=c('Year', 'month'), by.y=c('Year', 'month'))
joined4 = merge(joined3, joined4, by.x=c('Year','month'), by.y=c('Year', 'month'))

5.2 The long-term large wildfire Prediction and Analysis

We build models to predict the large wildfire number firesize and total burned area.

library(corrplot)
joined4$totalarea=joined4$n*joined4$FIRE_SIZE
cor_fire=cor(joined4[c(3,4,5,6,7,8)])
corrplot(cor_fire,method='number',type = 'lower', diag = TRUE)

model_case=lm(log(n)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_case)
xkabledply(model_case, title = paste("Case Model:",format(formula(model_case)) ))
Case Model: log(n) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.1644 0.4773 12.9163 0
tair_day_livneh_vic 0.0904 0.0112 8.0897 0
soilmoist1_day_livneh_vic -0.2847 0.0219 -12.9789 0
plot(model_case)

VIF(model_case)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.733188                  3.733188
model_avgsize=lm(log(FIRE_SIZE)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_avgsize)
xkabledply(model_avgsize, title = paste("Avg. Size Model Model:",format(formula(model_avgsize)) ))
Avg. Size Model Model: log(FIRE_SIZE) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.4492 0.9322 7.9908 0.0000
tair_day_livneh_vic 0.0407 0.0218 1.8642 0.0635
soilmoist1_day_livneh_vic -0.1770 0.0428 -4.1300 0.0000
plot(model_avgsize)

VIF(model_avgsize)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.733188                  3.733188
model_totalarea=lm(log(totalarea)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_totalarea)
xkabledply(model_totalarea, title = paste("Burn Area Model:",format(formula(model_totalarea)) ))
Burn Area Model: log(totalarea) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.6135 1.1019 12.3546 0
tair_day_livneh_vic 0.1310 0.0258 5.0809 0
soilmoist1_day_livneh_vic -0.4616 0.0506 -9.1154 0
plot(model_totalarea)

VIF(model_totalarea)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.733188                  3.733188

We have the model that r-squared value for model of cases 0.8616 model of average fire size 0.3366 model of average total burned size with 0.7391 By the plot check we found that the model of large fire case does not fit well as the residual is not consistent. And the model of average large fire size result shows lack of some normality from qq-plot.

6 Decision Tree model for MultiClass Classification

6.1 predicting all fire classes.

#install.packages("caret")
library(rpart)
library(rpart.plot)
library(caret)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.1.6     ✓ purrr   0.3.4
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x randomForest::combine() masks dplyr::combine()
## x tidyr::fill()           masks VGAM::fill()
## x dplyr::filter()         masks stats::filter()
## x dplyr::lag()            masks stats::lag()
## x purrr::lift()           masks caret::lift()
## x randomForest::margin()  masks ggplot2::margin()
## x dplyr::recode()         masks car::recode()
## x purrr::some()           masks car::some()
## x readr::spec()           masks yardstick::spec()

A decision tree model was created for multiclass classification to predict the class of future wildfires with accuracy. Firstly the insignificant variables were dropped out of the df. The variables used to train the model included “Avg_SoilMoisture”, “Avg_Temp”, “Avg_Rainfall”, “STAT_CAUSE_CODE”, and “FIRE_SIZE_CLASS”. This model (4.1) was built without using any tuning parameters.

These models have shown prominent results in trying to predict a certain class of fires. The first model while trying to predict all the fire classes did not show much promise in predicting the fires it had an accuracy of 52 percent which is slightly better than flipping a coin and trying to predict fire classes. Although accuracy was not the primary target of this model. The interest in these models comes from the structure as the models are tweaked and developed more there are many parameters and types of models that can be worked with for better models, this is evident when we tried to tune the model and managed to gain accuracy.Following is the model structure.

## CART 
## 
## 151640 samples
##      4 predictor
##      7 classes: 'A', 'B', 'C', 'D', 'E', 'F', 'G' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 121312, 121313, 121313, 121312, 121310 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa     
##    1        0.5242878  0.07930881
##    3        0.5242878  0.07930881
##    5        0.5242878  0.07930881
##    7        0.5242878  0.07930881
##    9        0.5242878  0.07930881
##   10        0.5242878  0.07930881
##   11        0.5242878  0.07930881
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 1.

Next, we looked at the Confusion matrix and summary statistics for the model. This also found an accuracy of 52.65% for the first model. The following table of shows the confusion matrix and summary statistics for the model. After looking at the results, it became even more evident that there are no substantial recordings of bigger fire classes like D, E, F, or G. Confusion matrix and Statistics.

6.2 predicting only fire classes A and B using Decision Tree.

Considering the absence of class C fires or higher, we then predicted the more frequent fire classes, that are classes A and B. We did this by building another model with the same data which was once again split in 80:20 ratio and trained this new model to see the improvement in the predictions. This model was also built without any tuning parameters modified. Following is the model structure.

## [1] 0.5732524

Next, the confusion matrix was plotted and it was found that this model produced the same accuracy of 0.52. To further improve this model, we decided to work on the rpart.control function and tune the parameters for a better fit. Following is the confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     A     B     C     D     E     F     G
##          A 20041 12300  1244   300   135    96    75
##          B  1827  1594   215    43    23     9     8
##          C     0     0     0     0     0     0     0
##          D     0     0     0     0     0     0     0
##          E     0     0     0     0     0     0     0
##          F     0     0     0     0     0     0     0
##          G     0     0     0     0     0     0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5707          
##                  95% CI : (0.5657, 0.5757)
##     No Information Rate : 0.5768          
##     P-Value [Acc > NIR] : 0.9924          
##                                           
##                   Kappa : 0.0326          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity            0.9165  0.11473  0.00000 0.000000 0.000000  0.00000
## Specificity            0.1179  0.91152  1.00000 1.000000 1.000000  1.00000
## Pos Pred Value         0.5861  0.42861      NaN      NaN      NaN      NaN
## Neg Pred Value         0.5087  0.64026  0.96151 0.990952 0.995832  0.99723
## Prevalence             0.5768  0.36650  0.03849 0.009048 0.004168  0.00277
## Detection Rate         0.5286  0.04205  0.00000 0.000000 0.000000  0.00000
## Detection Prevalence   0.9019  0.09810  0.00000 0.000000 0.000000  0.00000
## Balanced Accuracy      0.5172  0.51312  0.50000 0.500000 0.500000  0.50000
##                      Class: G
## Sensitivity          0.000000
## Specificity          1.000000
## Pos Pred Value            NaN
## Neg Pred Value       0.997811
## Prevalence           0.002189
## Detection Rate       0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy    0.500000

Finally, after trying different values to manipulate the fit of the model, using minsplit, to include the minimum number of observations that must exist in a node for a split to be attempted, and In any terminal leaf node, minbucket the smallest amount of observations. If just one of minbucket or minsplit is supplied, the code sets minsplit to minbucket*3 or minbucket to minbucket/3, depending on the case. ‘cp’ This parameter’s primary purpose is to reduce computation time by removing splits that are clearly not worth it. In essence, the user advises the computer that any split that does not enhance the fit by cp will most likely be trimmed out by cross-validation, and that the program does not need to pursue it further. Also did a accuracy tuning fit to improve the model fit.These models may have good potential in predicting the fire classes but in the case of larger fire classes, the limitation is that most likely there will be not enough data for us to train a model and be confident in the model as of now and that would be a good thing for fires to be not frequent in bigger classes. But, in the case of predicting larger fires, we could not have good confidence in the model even after tuning.

6.3 predicting fire classes A and B after tuning model fit.

Furthermore we ran the prediction again with the tuned model to test the fit after the tuning. The following parameters are used for tuning the model: minsplit = 4, minbucket = round(5 / 3), maxdepth = 3, cp = 0. After predicting the tuned model produced an accuracy with the same data set. Following is the model structure.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     A     B     C     D     E     F     G
##          A 20272 12434  1254   304   137    96    75
##          B  1596  1460   205    39    21     9     8
##          C     0     0     0     0     0     0     0
##          D     0     0     0     0     0     0     0
##          E     0     0     0     0     0     0     0
##          F     0     0     0     0     0     0     0
##          G     0     0     0     0     0     0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5733          
##                  95% CI : (0.5683, 0.5782)
##     No Information Rate : 0.5768          
##     P-Value [Acc > NIR] : 0.922           
##                                           
##                   Kappa : 0.0338          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity            0.9270  0.10508  0.00000 0.000000 0.000000  0.00000
## Specificity            0.1086  0.92180  1.00000 1.000000 1.000000  1.00000
## Pos Pred Value         0.5864  0.43739      NaN      NaN      NaN      NaN
## Neg Pred Value         0.5219  0.64034  0.96151 0.990952 0.995832  0.99723
## Prevalence             0.5768  0.36650  0.03849 0.009048 0.004168  0.00277
## Detection Rate         0.5347  0.03851  0.00000 0.000000 0.000000  0.00000
## Detection Prevalence   0.9119  0.08805  0.00000 0.000000 0.000000  0.00000
## Balanced Accuracy      0.5178  0.51344  0.50000 0.500000 0.500000  0.50000
##                      Class: G
## Sensitivity          0.000000
## Specificity          1.000000
## Pos Pred Value            NaN
## Neg Pred Value       0.997811
## Prevalence           0.002189
## Detection Rate       0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy    0.500000

We noticed an increase in the accuracy of this model in the above confusion matrix and summary statistics plot. (4.3.2, Confusion and stats plot)We were able to predict that the smaller, and the more frequent fire classes with the best accuracy of 57% through this model. However, predicting larger fires continue to be a challenge due to the imbalance in the dataset.For larger classes, the other models would do better. When closely looking at the models through many of the variations I have tried I’ve noticed that the deeper nodes in almost all the models branch at a clause with soil moisture and air temperature. This shows that soil moisture is a strong factor when looking for fire classes, fire class in another sense is the area of fire spread and burnt, this finding is credible because with low soil moisture value the fire is easily spread on the ground even with the countermeasures are in place.

7. Conclusion:

Through basic exploratory data analysis and modeling, this research concluded a strong relationship between the environmental factors and the wildfires, its burn area and the fire size. The linear modeling helps to show that regardless of the cause of the fire- whether by people or nature- its impact remains equally as devastating.

Although the multinomial models were biased due to imbalanced data, we learned that smaller fires are very common across California with over 90% of them being Class A or Class B fires. If more metrics other than the ones we used can be used to predict these small fires or larger (Class C and above), it would be a great help for fire departments across California, as they’d be able to better prepare for potential fires if conditions reach a certain point. However, to achieve this, better, more balanced data must be collected so that models can more accurately predict when larger fires will occur. In addition, a large portion of fires were reported as being caused by miscellaneous reasons- reporting needs to get better in order for better prediction ability.

8. Limitations of models and Future Research:

The raw data on the environmental factors were recorded at a daily frequency. To conduct the research, the frequency was converted into monthly and yearly. To make a more accurate prediction about the wildfire, it would have been beneficial to look at the exact data for temperature, soil moisture, and rainfall data for the wildfire instead of a monthly or yearly average. In addition, our wildfire data only went up to 2015; it would have been useful to see how frequent and intense fires from the past 7 years have been, especially with the acceleration of global temperatures.

The data we used is for the average whole California, it could not provide the variation of environment data with different latitude and longitude. Also, the model estimates the prediction for the whole california; it is not the average per specific size. The model could noe be applied with smaller region of California.

Through our EDA and modeling, we saw that we were dealing with incredibly imbalanced data, in that most of the fires recorded by the FPA (Fire Program Analysis) System fell into only a couple different categories. In particular, 48% of fires were caused by 3 main causes (debris burning, lightning, and miscellaneous reasons) out of 13 causes found in the dataset. In addition, just over 90% of wildfires fell into class A and B, which means that a vast majority of recorded fires were between 0 to 9.9 acres. The classifier we trained are sensitive with the dataset balance; since most of fire cases are in size class A and B, models consistent give the prediction in A or B can maximize the accuracy. Those models cannot be applied in prediction. In the preprocessing for the large fire predictor, we reduce the complexity of the prediction situation, and the logit models trained from such dataset achieve a usable score with AUC around 0.9. We either need to collect more data categories for inputing into model and balance our data so that our models can more accurately predict wildfire causes and sizes.

In the future, it would also be interesting to take a deeper look into Class C fires and study the factors that cause larger fires and how it impacts the burn time of these larger fires. Lastly, we would be curious to see how the analytical and prediction techniques used to study the wildfires in California work for different states like Colorado and even the wildfires in Australia.

California’s fire suppression data was also not granular enough in that it did not detail where exactly the money was going. For the years that we could compare fire frequency to the budget, it did not look like the budget had any direct effect on the wildfires in California. As a result, we would recommend the State of California look into how the money is being spent. In addition, we believe that more specific data on how the budget is being spent will help with optimizing allocation to different efforts appropriately.

9. References:

1.88 million US wildfires. Kaggle. (n.d.). Retrieved May 8, 2022, from https://www.kaggle.com/rtatman/188-million-us-wildfires Adapt. Cal. (n.d.). Retrieved May 8, 2022, from https://cal-adapt.org/data/download/ California Department of Forestry and Fire Protection (CAL FIRE). (n.d.). Cal fire. Cal Fire Department of Forestry and Fire Protection. Retrieved May 8, 2022, from https://www.fire.ca.gov/